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Overview 


Midori is written almost entirely in C# 
* Use C# for its developer productivity and reliability advantages 
* Basis for Midori's concurrency safety and lightweight processes 


Today's talk 

* How to compile C£ so that you can use it to write an OS 

* A 35,000 foot overview of Bartok, the compiler underlying Midori 
* Will schedule future talks based on researcher interest 


Approaches 


Compile ahead-of-time 


Reduce the memory and time overheads of managed code 
Top 5 features: 

* Generic sharing 

* Shared libraries 

* Class initialization at process start-up 

* Frozen objects 

* Efficient linked stacks for concurrency 

Will discuss shared library implementation in more depth 


Highly optimizing compiler 
* Same optimization capabilities as production С++ compilers 
* Extend for managed code 


Midori OS 


Runs on x86, x64, ARM Isolation i in depth 


+ Small, policy-free microkernel (C++) for HW 


interaction ware isolation between processes (type-safety) 


Domain kernel (C#) contains most policy: re isolation between domains for untrusted 
schedulers, memory, resource management 
Type safety, capability-based security 
* Most of the system 
+ All 3" party code, including device drivers as 
processes 
Domain 1 Domain 2 


Domain Services Midori App 


Domain Kernel 


Case Study: SPECWeb05 (from March ‘13 Midori talk) 


Midori can deliver performance competitive with Windows 


‚800 

70,600 

27,300 47,000 

49,461 78,442 
2x 10Gb NIC + 2x 1Gb NIC 
and Midori runs per benchmark rules. 

Windows Settings 
fficial Windows submissions on SPEC.org, except for custom at improves Windows performance 


Ahead-of-time compilation 
Compiler creates stand-alone executables: 


executable 
EI — a-e 
Bartok linker 


Instantiate all generic types at compile time 


For core system, drop C# features that rely on a JIT 


Dynamic class loading, reflection, unbounded polymorphic recursion, generic virtual 
methods 


Still use a JIT for dynamic scripting languages 


Ahead-of-time compilation 


Advantages 

* Faster machine code — time/memory for optimization not highly constrained 
* Avoids memory and runtime cost of JIT 

* Reuse existing OS approaches (for example, debugging, crash dumps) 


Disadvantages 
* Longer compile times change developer experience from .NET 


Ahead-of-time compiled generics 


Bartok, like CLR, uses a non-uniform representation strategy 
Value types have many different sizes 
Specialize generic for each different layout 
Avoids unexpected boxing of values (heap allocation) in systems code 


CLR lazily instantiates generic types 
Relies on JIT to build specialized generic code “as needed” 
NGEN can always fall back on this 


Challenge for ahead-of-time compilation: instantiating all generic 
types at compile time 
Was not clear how well this would work: large number of generics possible 
Interacts with library story (shared code) 


Example 


public class HashMap<TKey, TValue> : Map<TKey,TValue> 
{ 
readonly HashingComparer<TKey> m comparer. 
MapEntry<TKey, TValue»[] m entries = 
EmptyArray « MapEntry « TKey, TValue> ».Instance; 
int m count; 
int[] m buckets; 


public override bool TryGet(TKey key, out TValue value) 


struct MapEntry « TKey, Value» 

( 
public int HashCode; 
public int NextEntry; 
public TKey Key; 
public TValue Value; 


HashMap <string, int» creates these types 


Map<string, int», MapEntry<string,int>, 
HashingComparer<string>, 
EmptyArray<MapEntry<string,int> > 


MapEntry<string ,int> size 20 bytes (on x64) 
HashCode: offset 0, size 4 
NextEntry: offset 4, size 4 
Key: offset 8, size 8, GC pointer 
Value: offset 16, size 4 


HashMap <string, string» creates these types: 


Map<string,string>, MapEntry<string,string>, 
HashingComparer<string>, 
EmptyArray « MapEntry <string,string> > 


Mapkntry«string,string» size 24 bytes (on x64) 
HashCode: offset 0, size 4 
NextEntry: offset 4, size 4 

| Key: offset 8, size 8, GC pointer 


Value: offset 16, size 8, GC pointer 


Implementing generics 


Prove that number of generic instantiations is finite 


Each generic instantiation gets its own vtable 

C# has runtime type identity 

VTables can be costly, rivalling machine code in size in an executable 
Bartok merges VTables when it can prove type identity does not matter 


Structurally identical instantiations share code for methods (generic sharing). 
Bartok uses type passing (dictionaries) like CLR does for runtime operations dependent 
on type (e.g. new(T)). 

Sharing decision made on a per-method basis. 


Share code for instantiations with value type args, in addition to reference type args 
Value types popular in Midori systems programming (no heap allocation) 
Same size, alignment, GC pointers, and calling convention. 


Empirical evaluation of generics 


Compare native executable sizes to input MSIL sizes 
MSIL has one copy of the code for a generic type 

Native code has specialized (instantiated) copies 

Evaluate ratio for 3 Midori builds 

Generics are used: generic methods are about 30% of native code size 


Native/MSIL Native MSIL 

ratio (bytes) (bytes) 

Release-x64 3.24) 197,595,648 60,954,624 
Release-x86 2.73 тт 60,428,288 
Release-Tegra3 2.33, 111294976 47,695,872 


Approaches 


Compile ahead-of-time 


Reduce the memory and time overheads of managed code 
Top 5 features: 

* Generic sharing 

* Shared libraries 

* Class initialization at process start-up 

* Frozen objects 

* Efficient linked stacks for concurrency 

Will discuss shared library implementation in more depth 


Highly optimizing compiler 
* Same optimization capabilities as production С++ compilers 
* Extend for managed code 


Shared libraries 


Programmers expect to have large, rich class libraries 
How do we have lots of small processes then? 
Midori new process memory footprint: 145 Kbytes 


Solution: have libraries that are shared across many processes 
Code is loaded once and used many times, which amortizes memory footprint of code 


To provide the rich experience, allow any OO type to be exported 
Including generics, which complicate things. 

Windows exports flat interfaces 

Midori trade-off: library changes may be breaking changes at the binary level. 


App stores make this reasonable to consider 
Can recompile apps if library changes 


Midori shared libraries 


Currently 21 shared libraries. Some examples for x64: 


Name Description MSIL size Native size 
PlatformCore Core functionality 8,479,000 11,228,160 
PlatformGraphics Graphics stack 6,343,168 4,293,120 
PlatformNET Networking 602,624 2,306,650 
PlatformStorage File system 1,239,040 5,480,960 


PlatformWebRuntime HTML rendering 3,423,232 8,045,568 


Compilation model 


Applications/libraries are brittle with respect to libraries they use 
* Application/library may need a new generic instantiation for a value type 

* Application/library may subclass a type from a library 

* Ifa library changes, its consumers must be recompiled 


This cuts both ways, though: 

We use information about libraries during optimization 
* Cross-compilation unit inlining of method bodies 

* Existing generic methods 


Class initialization at process start-up 


C# class constructor semantics lead to runtime checks 

Lazy class initialization has checks before field accesses and static/instance method calls. 
"Before field init” has checks before field accesses 

These checks don't exist for C++ and lead to "peanut butter costs” 


To avoid these checks, Midori initializes all classes at process start up. 
Static constructors have no capabilities in Midori, so they can't do I/O 

Compiler analysis decides on class initialization order 

* Build dependency graph for class constructor code 

* Do topological sort, reject compilation units with cycles 


* Worst-case assumptions about methods that "escape" to prior compilation unit during 
class construction 


Static Constructor Scheduling and Shared Libraries 


Ordered list of static constructors 
(Exported via file) 
^ Ns 


Schedule for L2 computed 
incrementally. 

All static constructors for L1 
are run before L2 


At the process startup special runtime code 


invokes list of method pointers 


Frozen objects 


Statically-initialized readonly data is valuable 

* Improve process start up times: avoid cost of running class constructors 
* Reduce memory footprint: share data across all processes using library 
* GC friendly 


Try to evaluate each class constructor at compile time 
Results in: 


+ Initialized static fields pointing to a graph of objects in read-only memory (or 
containing immutable value types) 


* Ora decision to defer initialization of that constructor to process start up. 


Optimization depends on Midori's TSE type system 


Allows programmers to specific immutability of object graphs and fields 


Compile-time code execution 


Only done when optimizations are enabled 


Interprets code of static constructors at compile time 
* It doesn't process MSIL: Bartok IR is used 
* Platform-independent evaluation 


* Interpreter is comprehensive: only native code calls, exceptions, some unsafe code 
operations aren't supported 


Compiler loads types (including code) from libraries that are 
reachable from class constructors 


Done after creation of generic instantiations and shared methods 


Efficient linked stacks 


Midori can have thousands of async computations in flight at once. 
* Can't have a large stack per async computation 


* Solution is to break stacks into segments and link them together dynamically [Von 
Behren, SOSP '03] 


* Explicit stack checks and linking can cost 2-5% of performance (more peanut butter) 


Improve linked stacks using asymmetry between async vs. sync 
methods 

Type system differentiates between the two kinds of methods 

Synchronous methods can never block 

Execute async methods on linked stacks 


For synchronous methods, switch to a large stack with a guard page 
No checks needed! 
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Linked stack optimization 


Modified calling convention 

* Caller must provide a minimum amount of stack space 

* Optimizes small functions 

Minimize stack check placements 

* Push checks up call graph using interprocedural analysis to calculate max stack 
required 

* Propagate requirements across libraries 

Optimize switching between stacks 

* Stack switch injection driven by register allocator 

* Treat stack pointer as an implicit call argument that has one of two possible values 


* Register allocator minimizes dynamic cost of reloading values to registers, 
* It efficiently places the reload (switch) 
* Important runtime methods are stack neutral, no switch necessary 
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Approaches 


Compile ahead-of-time 


Reduce the memory and time overheads of managed code 
Top 5 features: 

* Generic sharing 

* Shared libraries 

* Class initialization at process start-up 

* Frozen objects 

* Efficient linked stacks for concurrency 

Will discuss shared library implementation in depth 


Highly optimizing compiler 
* Same optimization capabilities as production С++ compilers 
* Extend for managed code 


Implementing shared libraries 


A library is a set of MSIL assemblies compiled into one native image 
Defines methods, static fields, read-only data, and runtime metadata for types 


Static fields: 
* Need per-process instance of field. 
* Midori processes are in a single address space 
* Implement in software using a CPU environment block 


Library A — d 
Library B 


23 


CPU Environment Block 


Example code: 


// Reading a field in non-generic class D in A 
base, = FS:[s4] 


value = [base, + field offset] 


Slot numbers need to be global across domain 
A library needs to finds its static data at the same slot in every process 


Slot numbering isn't known at compile time 
Number of libraries on deployed system isn't known 


Would have to waste per-process space on libraries potentially not in use. 


Loader numbers libraries, embeds numbers into code during loading. 
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Accessing code and read-only data in library 


Everything is in same address space 
* Suppose executable E uses library L 
* Loader fixes up E to point directly to the code/data for L in memory 


Library L Image E 
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Libraries and generics 
Who defines an instantiation used in multiple libraries? 


Design choice: self-contained libraries and executables 
New instantiations always get code and vtables 
Multiple copies of code, vtables can exist at runtime 
Re-use instantiations from prior compilation units 
Simplifies system 
* No system-wide information / tables 
* Can always unload unused libraries 


Corlib 
G«T» 


[4 
А ; А " А Library А ib 
Identify potentially multiply instantiated 
(PMI) types 
"Pay as you go": only they pay when possible 
Instantiation is not a PMI if any type syntactically 


occurring in it is defined in current compilation unit 


PMI implementation 


Type tests 
* Use vtable pointer equality first 
* If that fails, use expensive structural type test 
Use per-type hashcode to speed up negative structural type tests 
Have to check for PMIs along some fast paths 


Static fields 

* Add another level of indirection to PMI field access 

* When compiling an executable, compiler: 
* For each PMI instantiation, picks one occurrence as the representative instantiation 
* Sets up per-library tables of representative instantiations 
* Used at class initialization to initialize the extra indirection 


PMI static fields 


Here G<C> is a PMI: 


CPU Environment inane ‘Statics 
т; Т ЖК 
Slot Offset Ri s // Accessing a multiply-instanced 
e Type Offset field in G«C» 
Е Ubrany (60 E 
LibraryB Sur nili 
G<D> space 
ac m | base, = FS:[s] 
G«C» pointer indirects; = [base, + GC, + 
G<C> space pointer_offset] 
W^ | value = [indirectg¢] 
ы 


8 
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Approaches 


Compile ahead-of-time 


Reduce the memory and time overheads of managed code 
Will discuss top five features we added for this: 

* Generic sharing 

* Shared libraries 

* Class initialization at process start-up 

* Frozen objects 

* Efficient linked stacks for concurrency 

Will discuss shared library implementation in depth 


Highly optimizing compiler 
* Same optimization capabilities as production С++ compilers 
* Extend for managed code 
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Optimizing compiler 


Bartok is the fusion of two optimizing compilers 
* Original MSR compiler, used by 


ш ШИШИ Reader n 
* The Phoenix compiler infrastructure, Sao a 
originally developed in DevDiv 
* Phoenix can be used to compile C# [tigh level Optimizer | 
and C++ code ST apr 
* Phoenix used to compile Midori C++ 
Interprocedural Opt 
i ИШЕТИН 


Optimizing compiler 


Modes 
* Library 
* Assume everything public or protected is called 
* Executable 
* Assume only entry points are called externally 
* May use information from library compilation 
* Whole program 
* Assume only entry points are called 
* Assume all code is known 


Profile-guided optimization 
* Orthogonal to the mode 

* Done only for Phoenix phases 

* All phases are profile aware 
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Optimizing compiler 


Bartok is the fusion of two optimizing compilers 
* Original MSR compiler, used by 


ЧИШ ШШШ ТАШЫН 
' Griginally developed т Devo ETT СЕНИ 
originally developed in DevDiv 
* Phoenix can be used to compile C# | HighlevelOpümizer | eT 
and C++ code БИТТИ ampT 
| | 
* Phoenix used to compile Midori C++ 
Interprocedural Opts 


code 


Global Optimizer 


Optimizing compiler 


Modes 
* Library 
+ Assume everything public or protected is called 
* Executable 
* Assume only entry points are called externally 
* May use information from library compilation 
* Whole program 
* Assume only entry points are called 


* Assume all code is known 


Profile-guided optimization 
* Orthogonal to the mode 

* Done only for Phoenix phases 

* All phases are profile aware 
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Optimizations 


Intraprocedural | Machine-level 
* Expression simplification * Hierarchical tiling-based register allocation 
* Expression reshaping * Stack packing 


* Instruction scheduling 

* Machine idioms 

* Code layout 

* Optimized struct copying/initialization 
* Optimized block copying/zeroing 


* Copy/constant propagation 

* Constant folding 

* Unreachable code 

* Dead code elimination 

* CSE 

* Value numbering 

* Partial redundancy elimination (PRE) 
* Array bounds check elimination 


* Control flow opts 
* Type test elimination 


* Range analysis (for eliminating safety 
checks) 


Optimizations 


[nterprocedural 
Inlining 
+ Tree shaking (instantiation and invocation) 


p Devirtualization using class/method 
hierarchy analysis 


* Hierarchy analyses extending to generics 
+ Stack allocation 

+ Constant propagation 

p Null check elimination 

+ Interprocedural range analysis 

+ Return value optimization 

p Bottom up summary information 


pene optimizations 
Loop invariant removal 

* Strength reduction + induction variable 
elimination 

p Loop unrolling 

* Loop versioning 


Managed-code specific 

P Write barrier insertion + specialization 
* Specialized type tests 

P GC synchronization analysis 

p Arithmetic check folding 

p Runtime metadata elimination 

* Runtime vtable merging 
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Image size 


Midori x64 build 
* Approximately 500 programs 
* Includes: 


* Browser 
* Shell 
* SpecWeb 
* Tetris 
* Most tests 
Native/MSIL Native MSIL 
ratio (bytes) (bytes) 
Release-x64 3.24 | 197,595,648 60,954,624 
Release-x86 243 | 165,058,048 60,428,288 
Release-Tegra3 2.33 | 111,294,976 47,695,872 
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Effect of generic sharing 


Generic sharing over reference types: 23% reduction 
Add generic sharing over value types: 10% reduction 


Total reduction: 33% 


(This is for an older version of the system) 


Shared libraries 


Before shared generics and shared libraries, Midori was>800 Mbytes 
(with much less functionality) 


Reduction from sequentially adding libraries one at a time: 


Base 48% 
Userlnterface 2296 
Testing 1696 


(This is from an older version of the system.) 


Midori is a feasible OS today because of shared libraries 


C# Integer Benchstone x64 


[T seconds NGEN seconds Feb-2013 seconds JIT/Feb-2013 — NGEN/Feb-2013 


benchi /02 (loweris better) _(loweris better) (lower is better) _ (higher is better) her is better 
Bqueens.c 4.79 485 5 0.96 0.97 
ackerman.c 5.33| 4.65 asaj agi 103 
addarra2.c 17.23) 17.22 13.93| 124 124 
addarray.c 2145 21.44 1491 144 144 
arrayl.c 8.42 8.43 7.86 1.07 1.07 
array2.c 30.76 30.85 48.04 0.64 0.64 
benche.c 156 1479 юв 147 139 
binserch.c 10.71 10.01 7.68| 139 130 
bubsort.c 17.75) 17.73 13.82) 128 128 
bubsort2.c 13.81 14,59 1082 128 135 
csieve.c па 11.38 эл5} 125 124 
fib.c 9.34 7.81 0.01 934.00 781.00 
heapsort.c 134 12.58 1112] 148 133 
iniarray.c 27.52 27.02 2.66) 10.35 10.16 
logicarr.c 2242] 22.46 1681| 133 1.34 
midpoint.c 18.46 18.49 21.02) 0.88 0.88 
imulmtx.c 20.65) 20.66 13.42 1.54 1.54 
ndhrysto.c 27.69) 28.64 17.59 157 1.63 
permutat.c 8.14) 10.14 721] 113. ial 
pic 10.95] 10.95 8.6) 127 127 
pule.c 145) 14,64 12.77 114 145 
quicksrtc 12.86 12.71 1127 144 133 
shelisrt.c 24.4 24.26 20.86 117 1.16 
sq mtxc 17.67 17.75 947| 187 187 
treesort.c 3.83) 3.69 3.76) 1.02 0.98 
tree_ins.c 917| 9.16 9.29 0.99 0.99. 
роз mtx.c 34.04) 34.01 30.13) 1.13 113 
14.07 14.08 1136| 124 1.24 
Geomean 13.94 13.83, 840 1.66 1.65, 
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SPEC2K6 x64 /O2 vs. VS 


1KG3 seconds 
(lower is better) 


CPU2006 INT /02 17 


astar 
brip2 

gc 

gobmk 
h264ref 
hmmer 
libquantum 
mef 
omnetpp 
peribench 
sjeng 
xalancbmk 
Geomean 


CPU2006 FP /02 17. 


dealll 
Ibm. 

mile 
пата 
povray 
soplex 
sphina 
Geomean 


605.5 
806.6] 
554.8] 
2| 
8573 
авл| 
взаз| 
452.6 
415.4 
neal 
8704| 
326.4] 
615.9 


(lower is better) 


5154 
sa| 
6159 
7202| 
292.8] 
373.8] 
3935| 
5417] 


Feb-2013 seconds 
(lower в better) 


7916 
5336 
7237 
8143 
608.9 
8397 
4666 
4013 
6849 
3349 
3013 


605.8. 


LKG3 seconds Feb-2013 seconds 
(lower is better) 


5223 
4539 
6187 

705 
3326 

368 
8783 
525.8 


102 


099 
0.99 
1.00 
102 
118 
103 
102 
103 


LKG3/Feb-2013 LKG3 bytes (lower 


her is better) Is better] 


93893 
102222 


2350965. 


677732 
496964 
204784 
94143 
69112 
547996 
924626 


154147 


2157540) 


330,628 


LKG3/Feb-2013 LKG3 bytes (lower 


(higher is better) is better) 


573876 
87874 
155592 
375944. 
934922. 

385832 
203693 
299,614 


Feb-2013 bytes 
(lower is better) 


97440 
100645 


2313612 


661880 
514059 


195730. 


69176 
554201 
932910 
149747 


2144857 


328,980 


Feb-2013 bytes 
(lowers better) 


621763 
89282 
164001 
362612 
919081 
374103 
197933 
301,156 


Feb-2013/LKG3 
lower is better) 


104 
0.98 
0.98 
0.98 
103 
096 
0.98. 


Feb-2013/LKG3 
(lower is better) 


108 
1,02 
1.05 
0.96 
0.98 
0.97 
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Conclusions 


Bartok is the compiler underlying Midori: 


Ahead-of-time compilation of C# 
Including generics, which are fully instantiated at compile time 


Features that reduce the memory and time overhead of managed 
code 

Shared libraries are required for Midori to be a viable OS 

If you have generics, you must have generic sharing 

Avoid peanut butter 


Highly optimizing compiler 
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Further reading 


http://midori/Midori%20Design%20Notes/Forms/Allltems.aspx: 


MDN 95: Midori Code Sharing and Separate Compilation Model 
MDN 143: Bartok Generics 

MDN 225: Frozen Objects 

MDN 226: Polymorphic Hierarchy 


MDN 228: Implementing the Midori Asynchronous Programming 
Model 


